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We analyze a series of publicly available controlled experiments (Latin square) on Affymetrix high 
density oligonucleotide microarrays using a simple physical model of the hybridization process. We 
plot for each gene the signal intensity versus the hybridization free energy of RNA/DNA duplexes 
in solution, for perfect matching and mismatching probes. Both values tend to align on a single 
master curve in good agreement with Langmuir adsorption theory, provided one takes into account 
the decrease of the effective target concentration due to target-target hybridization in solution. 
We give an example of a deviation from the expected thermodynamical behavior for the probe set 
1091_at due to annotation problems, i.e. the surface-bound probe is not the exact complement of 
the target RNA sequence, because of errors present in public databases at the time when the array 
was designed. We show that the parametrization of the experimental data with RNA/DNA free 
energy improves the quality of the fits and enhances the stability of the fitting parameters compared 
to previous studies. 

PACS numbers: 87.15.-v,82.39.Pj 



I. INTRODUCTION 

DNA microarrays 0, Q are devices capable of measur- 
ing the gene expression levels on a genome- wide scale and 
are based on the hybridization between surface-bound 
DNA sequences (the probes) and DNA, or RNA, se- 
quences in solution (the targets). While the specificity 
of the interaction between complementary base pairs A- 
T and C-G suggests that the hybridization of a single 
stranded DNA target with its perfect matching probe 
would be dominant, often strong non-complementary 
hybridization effects are observed (see Figure As 
the targets are fiuorescently labeled, the amount of hy- 
bridized DNA from each probe can be determined from 
optical measurements. The presence of strong cross- 
hybridizations is one of the reasons why one cannot in- 
terpret the fluorescent light intensities as direct measures 
of the gene expression levels. 

One of the most popular platform for DNA microar- 
rays is provided by Affymetrix [2j, which produces high- 
density oligonucleotide arrays. In Affymetrix chips short 
single stranded DNA sequences (25 nucleotides) are 
grown in situ using photolitographic techniques. As a sin- 
gle probe of just 25 nucleotides may not provide enough 
specificity for a reliable measurement of the gene expres- 
sion level, a set of 10-16 probes (the probe set) comple- 
mentary to different regions of the same target sequence 
are present in the chip. For each perfect matching (PM) 
probe there is a sequence differing by a single nucleotide. 
These are referred to as mismatching (MM) probes and 
are used to quantify the effects of cross-hybridization . 

Most of the available software packages for the calcula- 
tion of gene expression levels from the fluorescence inten- 
sities rely on algorithms of purely statistical or empirical 
nature 0,01 • I n the P as t t w0 years, however, several algo- 
rithms based on physical properties of the hybridization 
process were proposed [H, IS H H> ■ The basic idea be- 



hind the latter approach is that the intensities are linked 
to the hybridization free energies for the formation of 
the probe-target duplexes. For instance, for equal target 
concentration in solution, binding to CG-rich probes will 
provide a stronger signal compared to CG-poor probes 
jjCG nucleotides are more strongly bound than AT pairs 

In this paper we investigate a set of controlled experi- 
ments known as Latin square experiments and performed 
by Affymetrix in the human HGU95a chipset. In these 
experiments some target sequences are added at con- 
trolled ("spike- in") concentrations on a background ref- 
erence solution. The target concentrations range from 
to 1024 pM increasing as a power of 2 and following the 
scheme depicted in Fig. [21 covering all concentrations 
of biological interest. Note that in the table experiment 
vs. targets of Fig. |2Jequal concentrations are found along 
the lower left-upper right diagonals, following thus a pat- 
tern known as Latin square. The data, which are pub- 




FIG. 1: DNA microarrays are based on the hybridization of 
surface-bound DNA probes (thick) with target sequences in 
solution carrying fluorescent labels (thin). Besides perfect 
matching probe-target pairs forming ideal duplexes, partial 
hybridizations, or mismatches are possible, although they are 
expected to be thermodynamically less stable. 
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FIG. 2: In the Latin square experiments some selected target 
sequences are added at known concentrations following the 
scheme indicated in the figure. Aflymetrix considered 14 dif- 
ferent concentrations ranging from to 1024 pM (picomolars) 
and varying by a factor 2. In the Experiment 1, for instance, 
the RNA target 1 is absent from the solution (0 pM), the 
target 2 is present at a concentration of 0.25 pM ... In this 
scheme all possible 14 concentrations for each of the 14 target 
sequences are explored in 14 different experiments. 



licly available from the Affymetrix web site are i m ~ 
portant references for testing new algorithms that calcu- 
late gene expression levels from "raw" microarrays data. 
Not surprisingly, due to their central importance, the 
Latin square data have been analyzed by several groups 
through various approaches HSHHHHHOIIa- 
Our analysis is based on a simple physical model of 
target-probe hybridization. Although the modeling of 
microarray data with the physics of hybridization has 
been followed by other groups in the past couple of years 
IU, 0, 13 j our approach differs from what has been done 
so far in the following ways: 1) We use the free energy 
parameters of formation of RNA/DNA duplexes in solu- 
tion, and not the DNA/DNA parameters as in 2) We 
include the analysis of mismatches. 3) We include the 
effect of target-target hybridization in solution. 

The latter effect turns out to be an essential feature 
of our approach: When target-target hybridization is ne- 
glected the fit of the experimental data is very poor for 
half of the 14 spike-in genes. On the contrary, when hy- 
bridization in solution is included we obtain good fits 
of the experimental data with a simple theory contain- 
ing four fitting parameters. The ultimate test of the va- 
lidity of our approach is through the analysis of scaling 
collapses: when plotted as a function of an appropri- 
ate rescaled thermodynamic variable, which depends on 
an effective temperature, on the hybridization free ener- 
gies and on the target concentration, the Latin square 
data for different experiments tends to collapse into a 
single master curve. Although the noise level can still 
be large, significant deviations from this master curve 
are very rare. As we shall see, the deviations from the 
expected isotherm can be understood in several cases. 

This paper is organized as follows: in Section[n]we dis- 



cuss the thermodynamic parameters used in this paper, 
fn Sec. I11II we present the analysis of the Latin square 
data and the model used. In Section Hvl we present few 
examples of probes deviating from the expected behavior 
and discuss the origin of these deviation. Section con- 
cludes the paper with an overview of the results, open 
issues and a discussion of related works. 



TABLE I: The stacking free energy parameters AG37 for 
RNA/DNA hybrids measured in solution at a salt concen- 
tration 1 M NaCl and T = 37° C I4f. The upper strand 
is RNA (with orientation 5'-3') and lower strand DNA (ori- 
entation 3'-5'). Between parenthesis we give the DNA/DNA 
parameters. 
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rAA 
dTT 


1.0 


(1.00) 


rAC 
dTG 


2.1 (1.44) 


rAG 
dTC 


1.8 


(1.28) 


rAU 
dTA 


0.9 (0.88) 


rCA 
dGT 


0.9 


(1.45) 


rCC 
dGG 


2.1 (1.84) 


rCG 
dGC 


1.7 (2.17) 


rCU 
dGA 


0.9 (1.28) 


rGA 
dCT 


1.3 


(1.30) 


rGC 
dCG 


2.7 (2.24) 


rGG 
dCC 


2.9 


(1.84) 


rGU 
dCA 


1.1 (1.44) 


rUA 
dAT 


0.6 


(0.58) 


rUC 
dAG 


1.5 (1.30) 


rUG 
dAC 


1.6 


(1.45) 


rUU 
dAA 


0.2 (1.00) 
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II. THERMODYNAMICS OF RNA/DNA 
HYBRIDS 



The thermodynamics of duplex formation of nucleic 
acids in solution is well described by the nearest neighbor 
model according to which the free energy difference be- 
tween a duplex and two separated strands is given by the 
sum of the local terms which keep into account hydrogen 
bonding and base stacking |lfj|. In melting experiments 
in solution one usually determines AH and AS the en- 
thalpy and entropy differences between a duplex and two 
separate strands, from which the free energy difference 
AG = AH - TAS is obtained. The Table Q gives the 
free energy parameters at 1 M of NaCl and at T = 37° 
C (data taken from Ref. Q). The calculation of AG 
for a given sequence is obtained by summing up the data 
on Table [I] and adding to that a contribution of helix 
initiation AGfj*- = 3.1 kcal/mol 14]. 

The thermodynamic parameters for RNA/DNA hy- 
brids containing a single mismatch have recently been 
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determined 15]. The simple nearest neighbor model with 
stacking free energy parameters is no longer accurate for 
mismatches. For RNA/DNA single mismatches it has 
been found that the trinucleotide model, in which dis- 
tinct free energies are associated to the triplet formed by 
the mismatch and the two neighboring nucleotides, fits 
the experimental data reasonably well [15|]. As a MM 
probe in Affymetrix chips is realized by interchanging C 
with G and A with T in the middle nucleotide of a PM 
probe, there are four types of mismatches rGdG, rCdC, 
rUdT and rAdA. Taking into account the four possible 
combinations of neighboring nucleotides there are thus 
in total 64 different mismatches that should be consid- 
ered. The free energy of only part of these 64 triplets 
can be found in the present literature The full list 
of mismatch free energies used in this paper is given in 
the Appendix 1X1 



III. LATIN SQUARE DATA 

Usually, for the intensities / measured in the 
Affymetrix experiment one distinguishes the two contri- 
butions from non-specific (N) and specific (S) hybridiza- 
tions 31 . We follow the same idea here and write: 



7(c, AG) = N + S(c, AG) + e 



(1) 



where e denotes some experimental noise. Here I is the 
intensity from the probe whose complementary RNA tar- 
get is in solution at a concentration c (known for the 
"spike-in" genes) and AG the hybridization free energy. 
Note that we did not make any distinction between PM 
and MM probes, as we assume that their specific binding 
will depend only on c and AG. The non-specific hy- 
bridization N depends on the total RNA concentration 
in solution and possibly on other free energy parameters 
describing the partial matching with all RNA in solu- 
tion. However, the precise form of N is not relevant for 
the analysis performed in this paper, as we focus here on 



AI = 1(c) - 1(0) w S(c, AG) 



(2) 



The background subtraction is only possible in the Latin 
square set as there is always a reference measurement at 
zero spike-in concentration. The problem of calculating 
the "background" (N in Eq. QJ) from first principles 
approaches will be addressed elsewhere. 

As in Ref. p| , we model the specific hybridization as a 
two-state process where the target is either unbound in 
solution or fully hybridized to the probe forming a 25 nu- 
cleotides double helix at the surface, with one mismatch 
at position 13 for the MM probes. The Langmuir model 
predicts that: 



S(c,AG) 



Ac 



-0AG 



(3) 



1 + ce -/3AG 

here (3 = 1/ RT, where T is the temperature and R = 1.99 
cal/mol • K is the gas constant. The parameter A sets the 



scale of the intensity and corresponds to the saturation 
value in the limit where c 3> e' 9AG ', i.e. where the con- 
centration is high or the binding is strong. 
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FIG. 3: Signal intensities versus the bulk hybridization free 
energy for the probe set 1708_at, "spiked-in" at concen- 
tration 256 pM (a) and 1024 pM (b). The data are ob- 
tained from the Affymetrix experiments 1521m99hpp_av06 
and 1521a99hpp_av06, respectively. Filled and empty symbols 
refer to PM and MM probe sets. The solid lines are curves 
from Eq. ||3J where c is given in the experiment, A = 10 4 and 
for three values of the temperature (the arrow in (a) indicates 
the direction of increasing temperature). The inset shows a 
plot for a replicate of the 1024 pM experiment taken from the 
file 1532a99hpp_av04. Notice that in the latter the intensity 
of the PM probe 13 (indicated by the arrow) agrees very well 
with the Langmuir isotherm. 



A. High "spike-in" concentrations 

It is convenient to analyze first the limit of high "spike- 
in" concentrations, which we find to correspond to c > 
256 pM. At such high concentrations typically 1(c) 
1(0) thus the contribution of the background signal can 
be safely ignored. It is therefore equivalent to plot 1(c) 
or AI(c), as defined in Eq. (J2J). 

Figure |2| shows a plot of I vs. AG for the probe set 
1708_at for the concentrations of 256 (a) and 1024 pM 
(b). Both PM and MM probes (filled and empty sym- 
bols) are shown. The numbers label the probes, following 
the notation chosen by Affymetrix. For the MM probes 
we could calculate 3 out of 16 free energies, using the 
data given in Table ITTT1 Although fluctuations are quite 
strong, the intensities shown in Figure [3] tend to align to 
a single master curve both for PM and MM probes. 

The solid lines in Figure are plots of the Langmuir 
isotherm given in Eq. J3J . Assuming that the probe den- 
sity is roughly constant for the whole array, we expect 
that the value of the saturation amplitude A is the same 
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FIG. 4: As in Figure^Jfor the probe set 40322_at. The data 
are obtained from Affymetrix experiments given in the files 
1521b99hpp_av06 (a) and 1521d99hpp_av06 (b). 



for all probes. We analyzed the histograms of intensi- 
ties for the whole set of Latin square experiments and 
found that the probability of finding an intensity I drops 
sharply beyond I ma x ~ 10 4 , therefore we fix in the whole 
paper A = 10 4 . As the concentration c is known, the only 
free parameter is the temperature T. We find that best 
fits of the Langmuir isotherm with the experimental data 
are obtained for a temperature T — 700 ± 30°if , roughly 
twice as large as the temperature in the Affymetrix ex- 
periment, which is of T = 45°G w 320°^. 

The "discrepancy" between fitted and experimental 
temperatures deserves some discussion. We have esti- 
mated AG from a two state model summing up over the 
stacking parameters of Table |U In reality the binding of 
a target with a PM or MM probe can also involve fewer 
nucleotides. Moreover, the photolitographic process is 
not perfect and the surface bound probes have varying 
lengths (see Ref. ^j). These remarks indicate that the 
binding free energies are lower than those we have esti- 
mated on the basis of a simple two state process assum- 
ing that all probes have a fixed length of 25 nucleotides. 
However, we note that in plots of intensities versus AG, 
the hybridization free energies calculated from Table [I] 
the experimental data tend to align along a single master 
curve, as shown in Fig. Eland in the rest of the paper. 
This suggests that AG is a good thermodynamic variable 
to parametrize the experimental data. The fact that the 
data follow a Langmuir isotherm suggests also that dif- 
ferences with the true hybridization free energy in the 
array can be reabsorbed in a rescaling of the tempera- 
ture, as AG enters in the analysis through a Boltzmann 
weight exp(—/3AG). An "effective" temperature of about 
700° K implies that on average AG ar ray ~ AG so i/2. Be- 
ing an "effective" parameter T should not be compared 
directly to the experimental value. More important, for 



the purposes of this work, is the stability of T as a fitting 
parameter: our analysis indicates that T = 700° K fits 
rather well the experimental data for different probe sets 
and spike-in concentrations. 

A rather high effective temperature (T — 2100° K) 
was found in the fit of the Latin square data of Ref. |8j. 
The difference between our estimate and that of Ref. [2j 
is due to a different free energy parametrization (we use 
the more appropriate RNA/DNA values) and a differ- 
ent fitting procedure. Here we focus on fits of Langmuir 
isotherms as function of AG, rather than as function of 
the concentration as done in 8J. These issues are dis- 
cussed in Appendix iBl 

In Figure |2| one notices the presence of few "outliers", 
i.e. those probes whose intensities strongly deviate from 
the Langmuir isotherm, for instance as the probe 13 in 
Figure E2b) . The inset of Figure [3] shows a replicate of 
the experiment at a concentration of 1024 pM. In that 
case the intensity of probe 13 is in agreement with the 
Langmuir isotherm. The intensities from the probes 1 
and 2 instead deviate systematically from the Langmuir 
isotherm in all replicates of the 256 pM and 512 pM ex- 
periments. The origin of these deviations is discussed 
below. 

Figure 0] shows the intensities for the probe set 
40322_at for "spike-in" concentrations of 256 pM (a) and 
for the probe set 37777_at at a concentration of 1024 pM 
(b) . Again the trend of the PM and MM data is to align 
into a single master curve fitting quite well Eq. J5J|, when 
the same effective temperatures as in Figure are used. 
Similar behavior is found for the other spike- in genes |17| . 

B. General case 

In order to test the global functional form of the Lang- 
muir isotherm we turn now to the analysis of the full 
range of concentrations. From Eq. J3J) we expect that the 
experimental data should "collapse" into a simple master 
curve when plotted as a function of the scaled variable 
x = c exp(— /3AG). A preliminary analysis at various 
temperatures at around T = 700° K shows that the best 
fits are obtained for an effective temperature T — 680° 
K, which we fix now once for all. 

Figure shows a plot of AI, as defined in Eq. @, vs. 
x for the probe set 37777_at. Note that the large major- 
ity of probes follow indeed the Langmuir isotherm which 
takes the form Ax/ (1 + x), and which is shown as a solid 
line in Figure Only the intensities of the probe 16 de- 
viate substantially from it. Quite interestingly, probe 16 
still follows a Langmuir isotherm shifted along the hori- 
zontal axis. This shift is equivalent to a probe-dependent 
rescaling of the variable x. One can thus collapse all the 
data onto the curve Ax' /(I + x') by plotting AI as a 
function of 

x' = a k c e~^ G , (4) 
with ak probe dependent. For instance, for the probe set 
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FIG. 5: Intensities for the Latin square experiment (set 1521 
[TH ]') for the probe set 37777_at plotted as function of the 
rescaled variable x = cexp(/3AG). The probe numbers for 
both PM (smaller characters) and MM (bigger characters) 



are given. 
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FIG. 7: Example of reduction of the effective concentration 
in solution of the target sequences due to hybridization with 
other RNA fragments in solution (2) or due to secondary 
structure formation (4). For probes 2 and 4 the target se- 
quences available for hybridization is decrease. We model 
this effects with a simple function a decreasing exponentially 
in the RNA/RNA hybridization free energy [Eq. JSJ]. 
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FIG. 6: Plot of A7 vs. x = cexp(-/3AG) for the probe set 
1024_at. 

37777_at one could take a,k 1 for all probes except for 
probe 16 for which a 16 ss 10~ 3 . While in Figure [5] only 
one probe deviates sensibly from the Langmuir isotherm, 
in other cases the disagreement involves the majority of 
the probes. An example is given in Figure^ which shows 
a plot of AI vs. x for the probe set 1024_at. We note 
that the shift along the s-axis is predominantly to the 
right side of the Langmuir isotherm, corresponding to a 
rescaling parameter ctk < 1. An analysis of all the 14 
"spike-in" genes shows that half of them are quite well- 
behaving in the sense that most of the data on a plot of 
AI vs. x align along the Langmuir isotherm, as in Figure 
|SJ The remaining half resembles more the example of Fig- 
ure |H1 (all figures are shown in H3)- A closer inspection 
to these defective probes shows that most of them have 
a rather high hybridization free energy, typically larger 
than 30 — 35 kcal/mol. 

A rescaling factor a < 1 can also be interpreted as a 
lowering of the target concentration in solution c = ac. 
The most plausible explanation of this reduction of the 
concentration is the target-target hybridization in solu- 
tion, or RNA secondary structure formation, as schemat- 



FIG. 8: Plot of the rescaling factor a needed to shift the 
data points to the Langmuir isotherm Ax/(1 + x) as function 
of the RNA/RNA hybridization free energies for each probe. 
The solid line is the fit to the Eq. © as expected from a 
simple model of bulk hybridization. The circles and diamonds 
emphasize the data from the probe sets 408_at and 36889_at 
which contain some defective probes. All these probes tend 
to deviate more strongly from the average behavior. 

ically illustrated in Figure The figure shows an exam- 
ple four 25 nucleotides long regions of the target RNA, 
which are complementary to probes 1, 2, 3 and 4 of some 
probe set. The regions richer in CG, which have therefore 
higher hybridization free energy (in the example 2 and 4) , 
tend to form stable duplexes with other RNA fragments 
or to form some secondary structure. Once hybridization 
in solution has occurred the amount of target RNA avail- 
able for hybridization to the probe sequences is reduced. 

Figure[S]shows the reduction of the effective target con- 
centration ckfe for all the "spike-in" genes of the experi- 
ments 1521 as a function of the free energy of RNA/RNA 
duplex for each probe. The parameter evj; is determined 
from the distance of the experimental data to the Lang- 
muir isotherm Ax /{I + x) in AI vs x plots. In Figure|HI 
the data follow two different behaviors below and above 
45 kcal/mol. For low hybridization free energies au is 
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FIG. 9: Intensities for the probe sets 1024_at (a) and 37777_at (b) plotted as functions of the scaled variable x' — acexp(/3AG), 
which takes into account of a, the fraction of target sequences hybridizing in solution. As a comparison we show in the insets 
the same quantity plotted as a function of x — cexp(/3AG). 



constant and roughly equal to 1, indicating that those 
regions of the target RNA are single stranded and avail- 
able for binding to the probes. For free energies larger 
than 45 kcal/mol ak diminishes following approximately 
an exponential decay, due to the possible effect of en- 
hanced hybridization in solution. We stress that the 
free energies shown in Figure [5] are for RNA/RNA du- 
plexes and these are typically stronger than RNA/DNA 
or DNA/DNA counterparts. We fit the global behavior 
of Qfe with the following equation 



1 



l + gcxp(-/3'AG R ) 



(5) 



where AGr is the RNA/RNA hybridization free energy 
in solution. The best fit of the data is shown as a solid line 
in Figure il which leads to c w 10 -2 pM and /?' = 0.67 
mol/kcal, i.e. T = 725° K. 

The Eq. JSJ) resembles that for a two state process in 
which the target RNA reacts with a fragment with con- 
centration c. In reality there are many different match- 
ing fragments hybridizing with the same target region. 
One should not view c as a real concentration, rather the 
whole cexp(— /3AGr) as a global relative probability for 
hybridization in solution, which is obtained by averaging 
over all these processes. 

Having now fixed the four fitting parameters A, T, 
c and T', we can reanalyze the data collapse by using 
as a scaling variable x' — akcexp(/3AG), with ak given 
in Eq. ||SJ. Figure |5] shows the plot of AI with the 
new scaling variable for the probe set 1024_at (left) and 
37777_at (right). Notice the nice collapse of all PM and 
MM intensities into a single master curve, now in much 
better agreement with the Langmuir isotherm. Similar 
plots for all "spike-in" genes show equally good collapses 
(plots for all 14 genes of the Latin square set are given 
m 0). 

We can use the proposed model to fit the experimen- 
tal data keeping the absolute target concentration as the 
only fitting parameter. A plot of the fitted concentra- 
tion as a function of the spike-in concentration is given 
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FIG. 10: Fitted concentration vs. spike-in concentration for 
the probe set 1024_at. The solid line is y — x, while the 
dashed lines correspond to y = x/2 and y = 2x. 



in the inset of Fig. The fitted concentration is, apart 
from the region c s < 1 pM, within a factor two from 
the spike-in value c s . A result which compares favorably 
with other algorithms [aBlyl (f° r more details see |17|). 



IV. DEVIATIONS FROM THE MODIFIED 
LANGMUIR ISOTHERM 

An analysis of the 14 spike-in genes reveals that there 
are still a few probes deviating from the expected be- 
havior of the modified Langmuir isotherm Ax' /(l + x'), 
which takes into account the target-target hybridization 
in solution. Figure UTI shows two examples of such devia- 
tions: (a) the probe 9, both PM and MM, of the probe set 
1091_at and (b) the probe 10 of the probe set 36202_at. 
These deviations are systematic as they are observed in 
other replicates of the Latin square experiments and at 
all concentrations. Note that the large majority of the 
probes are in quite good agreement with the Langmuir 
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FIG. 11: Examples of deviations from the rescaled Langmuir isotherm Ax' /(I + x'). (a) The probe 9 of the probe set 1091_at 
has substantially lower signal than that expected, (b) The probe 10 of the probe set 36202_at has a significantly higher signal 
than that expected from the Langmuir isotherm. The insets of (a) show the intensities for the three "defective" probes, which 
do not align against the GenBank entries BC013368.2 and AL833563.1. All these probes have lower intensities than expected 
from the Langmuir curve Ax' /(l + x') (upper inset). A recalculation of the hybridization free energies for these probes leads 
to a horizontal shift of the data, which are much closer to the Langmuir isotherm (lower inset). 



isotherm as in the examples shown in Fig. [5] The devi- 
ations typically involve just one probe per probe set and 
they are observed in very few of the 14 spike-in genes of 
the Latin square set. 

It is very instructive to look at these deviations more 
in detail. We performed a systematic sequence align- 
ment against the whole human genome stored in public 
data banks (as GenBank) for all 14 probe sets of the 
Latin square experiments. Affymetrix arrays are pro- 
duced by photolitographic techniques and each probe is 
synthesized in situ using the sequences taken from Gen- 
Bank. However, the GenBank entries are continuously 
updated, and in some sequences errors ma y b e present. 
The Affymetrix NetAffx™ Analysis center [l8j , provides 
information on the GenBank entries used to design the 
probe sequences. 

Let us discuss first the probe set 1091_at. NetAffx™ 
indicates that this probe set was obtained from the Gen- 
Bank entry M65066.1. A sequence alignment indeed 
shows that all the probe set sequences for the 1091_at 
match fully with the GenBank entry M65066.1. The 
alignment also shows that the two other sequences with 
GenBank entries BC013368.2 and AL833563.1 match 
perfectly with 13 of the 16 probes of probe set 1091_at, 
while the match is only partial for the probes 1,2 and 
9. The Table [n] summarizes the results of the align- 
ment for the probe set 1091_at. The difference is a sin- 
gle nucleotide close to the 5' and 3' ends for the probes 
1 and 2, while there are 5 mismatching nucleotides for 
the probe 9. Note also that the GenBank sequence 
M65066.1 dates from 1994 (see Table 0, while the two 
other entries are much more recent. We therefore sus- 
pect that the entry M65066.1 contains some annotation 
errors. As Affymetrix probe sequences are obtained by 
public databases, which are constantly updated, inconsis- 
tencies between probes and actual mRNA sequences may 



be present in some GenBank entries. If we assume that 
BC013368.2 and AL833563.1 contain the correct mRNA 
sequence, then the hybridization free energies that were 
used in Fig. Illf a) for the probes 1, 2 and 9 are overes- 
timated and need to be corrected. Note that the three 
probes 1,2 and 9 have all intensities lower than expected 
from the Langmuir model as shown in the upper inset of 
Fig. Eta). 

Before discussing the free energies corrections, we re- 
call that the Affymetrix RNA target in solution is actu- 
ally anti-sense RNA, complementary to the usual mRNA 
sequences. Therefore the surface-bound probes have the 
same sequences as mRNA's, apart from the substitution 
of U with T. 

Probe 1: The new mRNA annotation from the se- 
quences BC013368.2 and AL833563.1 of Table HU im- 
plies that a CA mismatch with a triplet rACA/dTAT is 
formed when the target RNA hybridizes with the probe 
1. There is no information in the present literature [l5| 
about this mismatch triplet, therefore we cannot assign 
a free energy to it. We notice anyhow that the mismatch 
occurs very close to the 5' end of the probe sequence (see 
Table ITTfl - which is the "free" end as the probes are linked 
to the substrate at their 3' end. It is plausible that close 
to its free end the double helix RNA/DNA can be sub- 
stantially distorted without a large penalty in free energy. 
The bases in the mismatch rC/dA should be still able to 
form two hydrogen bonds, therefore we consider likely 
that this mismatch does not affect substantially the hy- 
bridization free energy. This is a plausible explanation of 
the fact that the probe 1 (both PM and MM) deviates 
only slightly from the Langmuir isotherm, as shown in 
the inset of Fig. Illf a) . 

Probe 2: The new annotation implies an extra mis- 
match of the type rGCC/dCCG, for which no free en- 
ergy has been given in the literature. The free energy 
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TABLE II: Best alignments for the probes 1, 2 and 9 of the probe set 1091_at. For each probe the first line is the sequence found 
in the Affymetrix chip, which aligns perfectly with the sequence with GenBank entry M65066.1 (second line). The sequences 
with GenBank entries BC013368.2 and AL833563.1 align perfectly with 13 of the probes in the probe set 1091_at, but they 
have some differences with the probes 1, 2 and 9. The differing nucleotides are underlined. The last column of the Table shows 
the sequence submission date to the GenBank. 

Probe Origin Sequence GenBank Date 

Affymetrix 5'-TATGAGATTGATCTTGCCCCTAATT-3' 

1 Blast 1 5'-TATGAGATTGATCTTGCCCCTAATT-3' M65066.1 10-NOV-1994 
Blast 2 5'-TGTGAGATTGATCTTGCCCCTAATT-3' BC013368.2 19-NOV-2003 

Blast 3 5'-TGTGAGATTGATCTTGCCCCTAATT-3' AL833563.1 13-MAY-2003 

Affymetrix 5'-GCAGAAGTCAAGCCAGCCGCGGCCC-3' 

2 Blast 1 5'-GCAGAAGTCAAGCCAGCCGCGGCCC-3' M65066.1 10-NOV-1994 
Blast 2 5'-GCAGAAGTCAAGCCAGCCGCGGGCC-3' BC013368.2 19-NOV-2003 

Blast 3 5'-GCAGAAGTCAAGCCAGCCGCGGGCC-3' AL833563.1 13-MAY-2003 

Affymetrix 5'-CTGTCCTTGGTCCG__CATGGCTCGTT-3' 

9 Blast 1 5'-CTGTCCTTGGTCCG__CATGGCTCGTT-3' M65066.1 10-NOV-1994 

Blast 2 5'-CTGTCCTTGGTCCGAGGCTGCTCGTT-3' BC013368.2 19-NOV-2003 

Blast 3 5'-CTGTCCTTGGTCCGAGGCTGCTCGTT-3' AL833563.1 13-MAY-2003 



differences between perfect triplets and triplets with a 
CC mismatch, as given in Figure El suggest as a rough 
estimate for the CC mismatch of about 4.5 kcal/mol. 
This causes a shift of the data toward a lower value of 
the variable x' and shifts them much closer to the curve 
Ax'/(l + x'). 

Probe 9: For the probe 9 the alignment of the sequence 
M65066.1 with those in BC013368.2 and AL833563.1 dif- 
fers the most. The duplexes formed in this case are as 
shown in Figure lT^l and contain an inner asymmetric loop 
with two arms with 4 and 5 nucleotides. It is difficult to 
evaluate the hybridization free energies for these config- 
urations. A rough estimate, taking AG loop = 0, yields a 
free energy shift of 8 — 10 kcal/mol. 

We have used the estimated free energies to correct 
for the x' variables for the probes 2 and 9. For instance 
a shift of 8 kcal/mol for the probe 9 implies a correc- 
tion factor of exp(— 8/RT) w 3 • 10~ 3 , where we have 
used T — 700° K. Analogously for the probe 2 we find 
exp(-i/RT) w 0.05. The insets in Fig. IT2Tal show 
the intensities for the three probes with conflicting align- 
ments in the case of un-normalized data (top) and data 
with rescaled factors for the probes 2 and 9 (bottom). In 
the latter case the agreement with the Langmuir isotherm 
is substantially improved. 

In order to assess on the possible frequency of annota- 
tion errors we have performed an alignment analysis of 
all probe sets of the Latin square set (for more details 
see 0| )• The only potential annotation problems were 
those detected for the probe set 1091_at, which suggests 
that these errors should not be too frequent, at least in 
the human genome. 

We turn now to the probe 10 of Fig. Illf bh which has a 
signal significantly higher compared to the Langmuir pre- 
diction. In the whole set of Latin square data we found 
only another example of a similar high signal, namely 



"Perfect Match" "Mismatch" 

d5 ' -. . . .TCCGCATGGCT. . . .-3 ' d5 ' -. . . .TCGGCATGGCT. . . .-3 ' 

ill III mi mi 

r3'-....AGGC CGA....-5' r3'-....AGGC CGA....-5' 

UCCGA UCCGA 

FIG. 12: Expected duplexes formed with the "defective" 
probe 9 of the set 1091_at for the PM and MM signal, if the 
mRNA sequence is taken from the entries BC013368.2 and 
AL833563.1. The upper strand is the surface-bound DNA, 
while the lower one is the RNA target. 



the probe 16 of the probe set 36085_at. Analyzing these 
two sequences we find that they share a common fea- 
ture: both are A-rich close to the 3' end. The sequences 
are . . . CACAAAAG-3' (36202_atl0) and . . . CAATAAA- 
5' (36085_atl6). Note that the Table HI gives for the com- 
bination rUU/dAA the lowest free energy. A possible 
improvement of the data collapse could be obtained by 
introducing position-dependent weights Wi with i = 1, 2 
... so that 

24 

AG = AG init - + W&Gi (6) 

i=l 

where AGi is taken from Table [I] and the sum is ex- 
tended over the nucleotides of the target sequence. One 
could assume that far from the substrate iu, w 1, so that 
the hybridization free energy reaches the solution limit 
and that uij < 1 close to the substrate. A reweighting 
of this type lowers the global value of AG (and conse- 
quently decreases the effective fitting temperature) and 
could lower the anomalously high signal of the two probes 
A-rich at their 3' end. It is not clear a priori which func- 
tion to choose for the weight Wi , apart from the fact that 
it should increase far from the surface, and different pos- 
sibilities will be explored elsewhere. We stress however 
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that the choice vii = 1, used in this paper, provides al- 
ready a quite good fit of the experimental data except in 
very few cases. 

V. DISCUSSION 

In this paper we have analyzed a series of controlled 
experiments on Affymetrix microarrays using a simple 
model of RNA/DNA hybridization. We have shown that 
for each probe (PM and MM) of a probe set the intensi- 
ties plotted as a function of the free energy of RNA/DNA 
hybridization tend to align along a single master curve in 
quite good agreement with a Langmuir isotherm. In fact 
the intensities of half of the "spike-in" genes are already 
well fitted with the simple form given in Eq. ©. For the 
other half, those containing probes with higher CG con- 
tent, we found that one has to include the effect of target 
hybridization in solution, which diminishes the effective 
concentration of single stranded RNA sequences in solu- 
tion. This effect is well described by a simple analytical 
form given by Eq. J5|. Despite this very simplified model 
the fit with the experimental data is very satisfactory, as 
shown by the scaling collapses (i.e. plots of intensities as 
function of the rescaled variable x' of Eq. I^jj). Although 
the data are somewhat noisy the calculations of the tar- 
get concentrations are in good agreement with the input 
spike-in values (see Fig. HUIand [17|). This is due to the 
fact that concentrations are obtained from averaging over 
the signal of each individual 16 PM probes and of the MM 
probes of which we were able to include in the analysis. 
The averaging over these data points yields quite robust 
and reliable estimated of target concentration values. 

The feature that is still missing in our analysis is the 
calculation of the background level (N in Eq. 0J). We 
circumvented this problem by subtracting from the in- 
tensities those measured at zero spike-in concentration. 
This is only possible for the Latin square data. A good 
estimate of the background level could help in improv- 
ing the quality of the fits in the low concentrations limit. 
These issues will be considered elsewhere. 

The physics of the hybridization in high density mi- 
croarrays has been investigated recently by other groups. 
We comment now on the differences between the present 
approach and what has been done in the literature so far. 
In a recent paper Hekstra et al. found nice agreement 
of the Latin square data with a Langmuir model. Their 
plots of rescaled intensities versus rescaled concentrations 
follow very well the curve x/(l + x), with small fluc- 
tuations. For their rescaling they use probe-dependent 
values, a procedure which requires the use of 24 fitting 
parameters. The advantage of our approach is that we 
find good collapses of the experimental data using a very 
simple model with only few fitting parameters. 

It has been recently claimed that the MM probes do 
not follow the behavior predicted by the standard hy- 
bridization theory 19]. Our analysis, instead, shows 
that MM probes intensities follow the same Langmuir 



isotherm as the PM probes. For the mismatches we used 
the trinucleotide free energies for RNA/DNA duplexes 
in solution [l5|. A very important aspect of the MM 
hybridization, as highlighted in studies of RNA/DNA 
duplexes melting in solution |l5j . is that their free en- 
ergy strongly depends on the type and order of the two 
nucleotides close to the mismatch. This is probably the 
reason why an analysis of the mismatches based on single 
base pairs energies, as in Ref. 0, shows deviations from 
the Langmuir isotherm of the PM probes. We believe 
that the strong dependence of the mismatch free energy 
on the two neighboring nucleotide is a very important 
aspect for the correct modeling of the hybridization of 
MM probes. 

In Ref. 0, the Langmuir model for target-probe hy- 
bridization was used to fit Affymetrix Latin square data. 
The hybridization free energies were obtained from val- 
ues of DNA/DNA duplexes in solution [2(j, and not for 
RNA/DNA duplexes, as we have done here. As discussed 
in Section m there are some differences between the two 
sets of parameters. An explicit example emphasizing the 
influence of these differences for the fitting procedure is 
shown in the Appendix FBI Our results show that a cor- 
rect free energy parametrization improves substantially 
the quality and stability of the fits. 

In other studies , the free energies were obtained 
directly from a fit of Affymetrix data assuming a input 
relationship I (AG). The binding free energies were taken 
dependent on the distance from the substrate 00 , while 
we have so far calculated free energies by summing up 
uniformly over all the stacking energies of the probe se- 
quences. As pointed out before, a position dependent 
weight in the free energy calculation may improve the 
quality of our data collapses for those few A-rich probes 
close to the substrate which we found to deviate more 
strongly from the Langmuir model. The overall quality 
of the fits remains however quite good also in absence of 
position-dependent binding (see Ref. 

Another effect which has been claimed to be rele- 
vant for hybridization in high density DNA microarrays 
is the Coulomb interaction between a highly negatively 
charged surface DNA layer and negatively charged tar- 
get molecules 0, |2^. These effects may play a role 
for a system with monodisperse probe length distribu- 
tion. However, in Affymetrix chips the probe lengths are 
widely distributed [Ifj, An analysis of the electrostatic 
interaction |17|. shows that its strength is much weaker 
compared to that of the systems studied in Refs. plll^. 
It is thus possible to neglect electrostatic effects, as we 
did here and as done in other studies involved with the 
physical modeling of hybridization in Affymetrix arrays 

EE00. 

Finally, one may wonder how representative the spike- 
in targets chosen by Affymetrix are for the overall be- 
havior of the microarray. A recent investigation p3 ] of 
several human housekeeping genes (i.e. those which are 
expressed in virtually all tissues) and of the Affymetrix 
spike-in data for the chipset HGU133 shows that the in- 
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TABLE III: AG37 for triplet mismatches in RNA/DNA du- 
plexes, where the upper strand is RNA (the orientation is 3' 
to 5' from left to right) and lower strand DNA. Only the mis- 
matches which are realized in the Affymetrix chip are shown. 
a: Deduced from the rUdG mismatches, b: Deduced from the 
rAdA mismatch, c: Deduced from the rCdT mismatches, d: 
Deduced from the rCdA mismatches. Between parenthesis is 
the free energy of a perfectly matching triplet obtained by in- 
terchanging C with G and A with T in the central nucleotide 
of the DNA strand. 



Sequence AG37(kcal/mol) 


Sequence AG37(kcal/mol) 


GG-mismatches 

f G %% 0.11 (-4.6) 

dccc - L 26 (-5.8) 

0.48° (-3.1) 

fall 1-24" (-2.8) 


dSoS "0-97 (-4.4) 
dCGG "2-25 (-5.6) 
Jtgg -0-62° (-4.5) 
dAGC 0-67° (-4.5) 


A A- mismatches 

foTc 1-05 (-2.7) 

f£ G c 0.20 (-3.1) 


fcTo 0.32 (-3.0) 
IcaS 0.29 (-3.4) 


UT- mismatches 

1.05 6 (-2.5) 
dCTC 0.44 (-2.7) 


£™ 0.25 (-2.4) 
S™g -0.22 (-2.6) 


CC-mismatches 

docc °- 73C (- 3 -8) 

£gG . 2 g d ( _ 4 . 4) 


S - H.2) 
£SS ~ (-4-8) 



tensities within a given probe set follow a distribution 
which is very similar to that observed in this work. 
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FIG. 13: Free energy differences between MM and PM triplets 
for the 18 mismatches given in Table UTTI where the labeling 
follows the same ordering as the Table. As an example the 
first point is the free energy difference between rCGG/dGGC 
and rCGG/dGCC, the second point the free energy difference 
between rCGC/dGGG and rCGC/dGCG ... 



APPENDIX A: FREE ENERGIES FOR 
MISMATCHES 

The Table lllll shows all the free energies for triplets 
with a single mismatch used in this paper. Part of these 
free energies are obtained from experimental results of 
Ref. I n some cases the free energies were deduced 

from the analogy with other mismatches. For instance, as 
pointed out in Ref. |l]| , the free energies for triplets with 
rAdA mismatches are close to those of triplets with rUdT 
mismatches. In absence of experimental determinations 
of mismatch free energies for the triplet rCUG/dGTC, 
we assign to the latter the same free energy as the mis- 
match rCAG/dGAC, which is of 1.05 kcal/mol. Of the 
18 triplet free energies in Table 1TTT1 11 were obtained by 
direct experimental data inputs, while 7 from similarities 
with other mismatches. 

Another interesting quantity is the free energy differ- 
ence between a perfect matching triplet and one with a 
central mismatch, i.e. the free energy shift due to a sin- 
gle mismatch. Figure 1131 reports the free energy differ- 
ences for the 18 mismatches given in Table ITnl following 
the same order. These differences are obtained by sub- 
tracting from the data in Table IIIII the values between 
parenthesis, corresponding to a perfect matching triplet. 
The Figure EH shows that the free energy difference is 
quite sensitive to the type of mismatch and of its two 
neighboring nucleotides. 
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APPENDIX B: COMPARING RNA/DNA WITH 
DNA/DNA HYBRIDIZATION FREE ENERGIES 

The approach followed in Ref. uses hybridization 
free energy for DNA/DNA duplexes in solution, while 
throughout this paper we used RNA/DNA parameters, 
as in the Affymetrix Latin square experiments the target 
is composed by RNA, while the probes are surface-bound 
DNA sequences. In order to illustrate the difference in 
the intensity vs. free energy plots we show in Figure ITU 
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FIG. 14: Intensity for the probe set 36311_at for a target con- 
centration of 256 pM plotted as function of (a) DNA/DNA 
and (b) RNA/DNA hybridization free energies, (a) repro- 
duces Figure 5 of Ref. ||, and the solid line is the best fit 
with the parameters used in that reference; the probes 7 and 
8 (indicated as arrows) are considered as outliers. In the case 
(b) the solid lines are the Langmuir isotherms from Eq. (J^J, 
A — 10 4 , c = 256 pM and three values of the temperature. 

the fit to the Langmuir model the intensities of the probe 
set 36311_at (from the Affymetrix file 1521g99hpp_av06) 
when (a) DNA/DNA and (b) RNA/DNA hybridization 
free energies are used. The figure shows the same data 
of Figure 5 of Ref. 8] . The best fit obtained from the 
parameters of Ref. 8] is the thick solid line shown in 
Figure Et a )> where the probes 7 and 8, indicated by ar- 
rows, were considered as outliers. In Figure 1141^ the 



solid lines are the Langmuir isotherms with the same pa- 
rameters as in Figures |^a) and^a). By comparing the 
two plots one can conclude that the fit in Figure PHIL ), 
which is also consistent with plots for other probe sets 
(see Figs. |3]and|3}, is more convincing than that shown 
in the case (a). Note that in Fig. [H|L>) the four MM 
intensities follow the same behavior as the PM probes. 
The analysis of Ref. |8| is restricted to PM probes only. 

The difference between the fits performed here and 
those reported in Ref. || are also due to a different ap- 
proach to the analysis. In Ref. y| the intensities for each 
probe are fitted as a function of the concentration c us- 
ing the three adjustable parameters A, K = exp(— /3AG) 
and the background level N. Although a three parame- 
ters fit appeared to reproduce experimental data for dif- 
ferent probes very well @ one of the problems with this 
analysis is that A was found to vary over one order of 
magnitude from probe to probe. As this is an unphysical 
feature, A was then kept constant for all probes while 
the other two parameters were allowed to vary Q. Ref. 
@ reports as best fitted value A « 9 500, which com- 
pares favorably to our choice A = 10 4 . The problem 
is that the fits of Intensities vs. concentration obtained 
by binning over different free energies are less convincing 
(see Fig. 3 of Ref. Q). Analyzing then the decay of 
1/K as function of the DNA/DNA hybridization free en- 
ergy an effective temperature of T « 2 100° K is found. 
This is roughly three times higher of what we find in this 
paper, from a direct analysis of plots of intensities vs. 
RNA/DNA hybridization free energies. The use of an 
accurate free energy parametrization is very important: 
as AG is related to the intensity thorough an exponen- 
tial factor, small variations of AG estimates may have 
profound influences on the values and robustness of the 
fitting parameters. 
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